-
Notifications
You must be signed in to change notification settings - Fork 450
[BugFix] fix tcgen5mma example #1577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
📝 WalkthroughWalkthroughBlock_N parameter reduced from 256 to 128 in GEMM kernel configuration. A_shared and B_shared buffer shapes adjusted accordingly. Loop logic and GEMM operations remain unchanged. Inline comments added to T.copy statements for clarity. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
examples/gemm_sm100/gemm_tcgen5mma.py (1)
41-42: Consider clarifying the inline comments.The comments
# not trans_Aand# trans_Baccurately reflect the configuration values, but they could be more descriptive about what they document. Consider rephrasing to clarify whether they describe the data layout or the configuration, for example:# A is row-major (trans_A=False)and# B is column-major (trans_B=True).
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
examples/gemm_sm100/gemm_tcgen5mma.py
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: botbw
Repo: tile-ai/tilelang PR: 691
File: src/tl_templates/cuda/gemm_sp_sm80.h:81-85
Timestamp: 2025-09-15T10:51:06.985Z
Learning: In CUTLASS tensor operation layouts, crosswise constants should be computed using sizeof(T) (bytes), not cutlass::sizeof_bits<T>::value (bits). This is the established pattern in the official CUTLASS codebase, as seen in default_mma_core_sparse_sm80.h.
Learnt from: botbw
Repo: tile-ai/tilelang PR: 691
File: src/tl_templates/cuda/gemm_sp_sm80.h:81-85
Timestamp: 2025-09-15T10:51:06.985Z
Learning: In CUTLASS tensor operation layouts, crosswise constants should be computed using sizeof(T) (bytes), not cutlass::sizeof_bits<T>::value (bits). However, the layout template parameter should use sizeof_bits<T>::value (bits). This is the established pattern in the official CUTLASS codebase, as seen in default_mma_core_sparse_sm80.h where Crosswise uses sizeof(ElementA) but the layout template uses sizeof_bits<ElementA>::value.
🔇 Additional comments (2)
examples/gemm_sm100/gemm_tcgen5mma.py (2)
55-55: LGTM! The block size reduction addresses the bug.Reducing
block_Nfrom 256 to 128 significantly reduces shared memory usage per block (from ~160KB to ~96KB), which likely addresses hardware constraints or memory limitations specific to TensorCore Gen5 MMA operations. The dependent buffer shapes (A_shared, B_shared, C_shared) are properly adjusted by the existing logic.
58-58: No action needed. The transition fromnum_stages = 0tonum_stages = 2is intentional behavior by design. The conditional logic explicitly enables 2-stage pipelining when all block dimensions are smaller than 256, and disables it when any dimension is 256 or larger. With the new tile configuration (128×128×128), this results in the correctnum_stages = 2setting. The code includes validation tests that confirm correct compilation and numerical accuracy.
Summary by CodeRabbit
Documentation
Chores
✏️ Tip: You can customize this high-level summary in your review settings.